The source of the data we use in this lab is from Open ML. In particular, we are going to use the Phoneme data set, the description of which can be found here: https://www.openml.org/d/1489.
phoneme = read.csv("phoneme.csv", stringsAsFactors=TRUE)
head(phoneme)
## V1 V2 V3 V4 V5 Class
## 1 1.7494 -0.0504 -0.4321 0.7681 -0.5974 Nasal
## 2 1.9682 -0.8207 -0.2183 0.0900 0.4641 Nasal
## 3 -0.1535 -0.1977 1.1861 0.1394 -0.6372 Oral
## 4 -0.3297 -0.2271 0.3609 -1.5044 -1.9235 Oral
## 5 2.6405 -0.7292 -1.0152 -0.3963 -0.2138 Nasal
## 6 -0.3896 -0.1427 1.1188 0.7823 1.6400 Oral
The data set has 5 numeric predictors and one response variable class, which has two levels: Nasal (nasal vowels) and Oral (oral vowels). The numerical variables have all be standardised (and followed with a rounding to 4 d.p.) to have mean 0 and standard deviation 1, as also in the original data set. A pairwise scatterplot of the data looks like the following:
with(phoneme, pairs(phoneme[,-6], col=c(2,4)[Class], pch=c(1,3)[Class]))
Library needed for this lab:
library(mclust)
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
Write R rode to reproduce the following plot which follows the above description. Make sure that the majority class is determined by code automatically (i.e., not by eyes) so it can be reused easily.
table(phoneme$Class)
##
## Nasal Oral
## 709 291
phoneme_nasal <- phoneme[phoneme$Class=="Nasal",]
phoneme_oral <- phoneme[phoneme$Class=="Oral",]
NasalOralda = rbind(phoneme_nasal[,-6],phoneme_oral[,-6])
pairs(da,col = c(rep(2,nrow(phoneme_nasal)), rep(4,nrow(phoneme_oral))),
pch=c(rep(1,nrow(phoneme_nasal)), rep(3,nrow(phoneme_oral)))
)
par(mfrow=c(2,2))
for(bw in c("nrd0", "ucv", "bcv", "SJ")) {
hist(phoneme$V1, freq=FALSE, breaks=100,
col="gray80", border="white",
main=paste0("h = ", bw), xlab="V1")
lines(density(phoneme$V1, bw=bw), col=4, lwd=2)
}
(r = densityMclust(phoneme$V1, G=1:20, modelNames=c("E","V"))) # Let BIC decide
## 'densityMclust' model object: (V,6)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
summary(r)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust V (univariate, unequal variance) model with 6 components:
##
## log-likelihood n df BIC ICL
## -939.6717 1000 17 -1996.775 -2467.617
plot(r, phoneme$V1, "density", breaks=100, lwd=2, col=2)
(r_E1 = densityMclust(phoneme_nasal[, -6], G=1:9, modelNames=c("EEE"))) # Let BIC decide
## 'densityMclust' model object: (EEE,9)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
summary(r_E1)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 9
## components:
##
## log-likelihood n df BIC ICL
## -3768.155 709 68 -7982.651 -8102.045
(r_E2 = densityMclust(phoneme_oral[, -6], G=1:9, modelNames=c("EEE"))) # Let BIC decide
## 'densityMclust' model object: (EEE,8)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
summary(r_E2)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 8
## components:
##
## log-likelihood n df BIC ICL
## -1495.103 291 62 -3341.952 -3397.139
For class Nasal, the best density estimate in the equal variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (EEE,9)
For class Oral, the best density estimate in the equal variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (EEE,8)
plot(r_E1, phoneme_nasal[, -6], what="density", col=4, points.col="grey")
plot(r_E2, phoneme_oral[, -6], what="density", col=4, points.col="grey")
(r_V1 = densityMclust(phoneme_nasal[, -6], G=1:9, modelNames=c("VVV"))) # Let BIC decide
## 'densityMclust' model object: (VVV,9)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
summary(r_V1)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9
## components:
##
## log-likelihood n df BIC ICL
## -2507.006 709 188 -6248.017 -6286.289
(r_V2 = densityMclust(phoneme_oral[, -6], G=1:9, modelNames=c("VVV"))) # Let BIC decide
## 'densityMclust' model object: (VVV,3)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
## [17] "density"
summary(r_V2)
## -------------------------------------------------------
## Density estimation via Gaussian finite mixture modeling
## -------------------------------------------------------
##
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3
## components:
##
## log-likelihood n df BIC ICL
## -1453.461 291 62 -3258.669 -3296.499
For class Nasal, the best density estimate in the varying variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (VVV,9)
For class Oral, the best density estimate in the varying variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (VVV,3)
plot(r_V1, phoneme_nasal[, -6], what="density", col=4, points.col="grey")
plot(r_V2, phoneme_oral[, -6], what="density", col=4, points.col="grey")
p_E1 <- predict(r_E1, phoneme[, -6])
head(p_E1)
## [1] 1.945426e-03 3.735163e-03 5.198055e-03 2.702895e-05 2.643086e-02
## [6] 3.356227e-04
p_E2 <- predict(r_E2, phoneme[, -6])
head(p_E2)
## [1] 1.923916e-13 9.998861e-16 1.325117e-02 2.587213e-04 1.289425e-22
## [6] 3.095459e-02
# compare which posterior probability larger, then belongs to which class
yhat_E <- ifelse(p_E1>p_E2, 1, 2)
head(yhat_E) # 1=Nasal, 2=Oral
## [1] 1 1 2 2 1 2
phoneme2 = phoneme
phoneme2$Class <- ifelse(phoneme2$Class == "Nasal", as.integer(1), as.integer(2))
head(phoneme2)
## V1 V2 V3 V4 V5 Class
## 1 1.7494 -0.0504 -0.4321 0.7681 -0.5974 1
## 2 1.9682 -0.8207 -0.2183 0.0900 0.4641 1
## 3 -0.1535 -0.1977 1.1861 0.1394 -0.6372 2
## 4 -0.3297 -0.2271 0.3609 -1.5044 -1.9235 2
## 5 2.6405 -0.7292 -1.0152 -0.3963 -0.2138 1
## 6 -0.3896 -0.1427 1.1188 0.7823 1.6400 2
(train_error = mean(phoneme2$Class != yhat_E)) # resubstitution misclassification rate
## [1] 0.199
“EEE” resubstitution misclassification rate: 19.9%
p_V1 <- predict(r_V1, phoneme[, -6])
head(p_V1)
## [1] 3.084490e-03 3.083380e-03 3.961820e-04 1.233538e-09 5.004490e-01
## [6] 1.956755e-04
p_V2 <- predict(r_V2, phoneme[, -6])
head(p_V2)
## [1] 1.022265e-07 2.254148e-09 1.383541e-02 5.228346e-05 3.060883e-14
## [6] 3.265301e-03
# compare which posterior probability larger, then belongs to which class
yhat_V <- ifelse(p_V1>p_V2, 1, 2)
head(yhat_V) # 1=Nasal, 2=Oral
## [1] 1 1 2 2 1 2
(train_error = mean(phoneme2$Class != yhat_V)) # resubstitution misclassification rate
## [1] 0.164
“VVV” resubstitution misclassification rate: 16.4%
(r_K2 = kmeans(phoneme[, -6], centers=2)) # K = 2
## K-means clustering with 2 clusters of sizes 396, 604
##
## Cluster means:
## V1 V2 V3 V4 V5
## 1 0.8236333 0.3351551 -0.8928588 -0.3628811 -0.12521818
## 2 -0.5399972 -0.2197373 0.5853838 0.2379156 0.08210265
##
## Clustering vector:
## [1] 1 1 2 2 1 2 2 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 2 2 2 2 1 2 2 1 1 1 2 1 1
## [38] 2 2 2 2 2 2 1 1 1 2 1 1 2 1 2 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 2 1
## [75] 2 1 1 1 2 1 2 1 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 1 1 2 1 1 2 1 1
## [112] 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2
## [149] 2 1 2 1 2 1 2 1 2 1 1 2 1 1 2 1 1 2 2 1 1 1 2 1 1 1 2 2 2 2 1 1 1 2 1 2 1
## [186] 1 2 1 1 2 1 2 1 1 2 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 2 2 2 1 1 1
## [223] 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2
## [260] 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2
## [297] 2 1 1 1 1 1 2 1 1 2 2 1 2 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 1 1 2 1 2 2
## [334] 2 1 2 1 1 2 2 2 2 1 1 1 2 1 1 2 1 2 1 1 2 2 2 2 1 1 2 1 1 2 1 1 2 2 1 2 2
## [371] 2 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 2 1 1 1 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2
## [408] 2 2 1 2 2 1 1 2 1 1 1 1 1 2 2 1 2 2 1 2 1 1 2 1 1 2 1 1 2 1 2 2 2 2 1 1 2
## [445] 2 1 2 2 1 1 1 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 2 2
## [482] 1 1 2 1 2 2 2 1 1 2 1 2 2 2 2 1 1 1 2 2 2 2 1 1 1 2 1 2 1 2 2 2 2 2 2 1 1
## [519] 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2
## [556] 2 2 2 1 1 2 1 2 2 2 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 2 2 2 1 2 1 2 2 1 1 2 1
## [593] 1 1 2 1 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1 2 1 1 2 1
## [630] 2 1 1 2 2 1 2 2 2 1 1 2 2 2 2 1 1 2 1 1 2 1 2 1 2 1 1 2 2 2 1 2 1 1 1 2 2
## [667] 2 2 2 1 1 1 1 2 2 2 1 1 1 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2 1 2
## [704] 2 1 2 1 1 2 1 1 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 1 2 2 2 2 1 1
## [741] 1 2 2 1 1 2 1 1 1 1 2 2 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1
## [778] 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1
## [815] 2 1 1 2 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 2 2 1 2 2
## [852] 1 2 2 1 2 2 1 1 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 1
## [889] 2 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [926] 2 2 1 1 2 2 1 2 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1
## [963] 2 2 1 1 2 1 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 1 2 1 2 2 1 2 2
## [1000] 1
##
## Within cluster sum of squares by cluster:
## [1] 1356.678 2500.634
## (between_SS / total_SS = 22.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
table(r_K2$cluster) # 1=Nasal, 2=Oral
##
## 1 2
## 396 604
da = rbind(phoneme_nasal[,-6],phoneme_oral[,-6])
pairs(da, col = c(rep(2,max(table(r_K2$cluster))),
rep(4,min(table(r_K2$cluster)))),
pch = c(rep(1,max(table(r_K2$cluster))),
rep(3,min(table(r_K2$cluster))))
)
Compute the adjusted Rand indices for K=2,…,9 clusters as found by the K-means method, when the given class labels are contrasted. Comment on the results.
set.seed(769)
ari = double(8)
for(k in 2:9) {
r = kmeans(phoneme2[, -6], centers=k)
ari[k-1] = adjustedRandIndex(phoneme2$Class, r$cluster)
}
ari
## [1] 0.20997134 0.11711319 0.16520438 0.09069357 0.06728827 0.04566082 0.04665725
## [8] 0.04944605
(r_M2 = Mclust(phoneme[,-6], G=2, modelNames="VVV"))
## 'Mclust' model object: (VVV,2)
##
## Available components:
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "loglik"
## [9] "df" "bic" "icl" "hypvol"
## [13] "parameters" "z" "classification" "uncertainty"
p_M2 <- predict(r_M2)$classification # clustering
table(p_M2) # 1=Oral, 2=Nasal
## p_M2
## 1 2
## 329 671
da = rbind(phoneme_nasal[,-6],phoneme_oral[,-6])
pairs(da, col = c(rep(2,max(table(p_M2))),
rep(4,min(table(p_M2)))),
pch = c(rep(1,max(table(p_M2))),
rep(3,min(table(p_M2))))
)
set.seed(769)
ari = double(8)
phoneme3 = phoneme
phoneme3$Class <- ifelse(phoneme3$Class == "Oral", as.integer(1), as.integer(2))
head(phoneme3)
## V1 V2 V3 V4 V5 Class
## 1 1.7494 -0.0504 -0.4321 0.7681 -0.5974 2
## 2 1.9682 -0.8207 -0.2183 0.0900 0.4641 2
## 3 -0.1535 -0.1977 1.1861 0.1394 -0.6372 1
## 4 -0.3297 -0.2271 0.3609 -1.5044 -1.9235 1
## 5 2.6405 -0.7292 -1.0152 -0.3963 -0.2138 2
## 6 -0.3896 -0.1427 1.1188 0.7823 1.6400 1
for(k in 2:9) {
r = Mclust(phoneme3[, -6], G=k, modelNames="VVV")
ari[k-1] = adjustedRandIndex(phoneme3$Class, predict(r)$classification)
}
ari
## [1] 0.02963741 0.05014537 0.07091669 0.05169660 0.03658624 0.03973634 0.03236527
## [8] 0.03572984
d = dist(phoneme[,-6]) # pairwise Euclidean distances
class(d)
## [1] "dist"
head(d)
## [1] 1.507829 2.580338 3.450439 1.760060 3.463422 3.110983
cex = 1.5
r_C = hclust(d) # complete linkage, by default
names(r_C)
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
plot(r_C, cex.axis=cex, cex.lab=cex, cex.main=cex) # dendrogram
r_S = hclust(d, method="single") # single linkage
names(r_S)
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
plot(r_S, cex.axis=cex, cex.lab=cex, cex.main=cex) # dendrogram
Complete Linkage: • Use the largest pairwise observation dissimilarity. • Tend to produce compact clusters. • Some observations in the cluster may be closer to the observations in another cluster.
Single Linkage: • Use the smallest pairwise observation dissimilarity. • Close, immediate observations are linked first. • May produce extended, chain-shaped clusters, but almost no gaps between the observations.
Single Linkage in this case has more skewness than complete linkage, also the split is much closer. This is because single Linkage uses the smallest pairwise observation dissimilarity, while complete Linkage uses the largest pairwise observation dissimilarity.
p_C2 <- cutree(r_C, 2) # cluster labels of observations for K = 2
table(p_C2) # 1=Nasal, 2=Oral
## p_C2
## 1 2
## 825 175
set.seed(769)
ari = double(8)
for(k in 2:9) {
r = hclust(d) # complete linkage, by default
ari[k-1] = adjustedRandIndex(phoneme2$Class, cutree(r, k))
}
ari
## [1] 0.15231471 0.08742364 0.04782010 0.06935799 0.07016402 0.07475356 0.08060478
## [8] 0.08153396
p_C2 <- cutree(r_S, 2) # cluster labels of observations for K = 2
table(p_C2) # 1=Nasal, 2=Oral
## p_C2
## 1 2
## 999 1
set.seed(769)
ari = double(8)
for(k in 2:9) {
r = hclust(d, method="single") # complete linkage, by default
ari[k-1] = adjustedRandIndex(phoneme2$Class, cutree(r, k))
}
ari
## [1] -0.0011781335 0.0005002740 -0.0006730085 0.0021753775 0.0050229980
## [6] 0.0038415529 0.0083236625 0.0143949554
heatmap(as.matrix(phoneme[, -6]), scale="column", distfun=dist, hclustfun=hclust,
margins=c(3,4))
heatmap(as.matrix(phoneme[, -6]), scale="column", distfun=dist, hclustfun=function(x) hclust(x,method = 'single'), margins=c(3,4))
In this lab, we explored the Phoneme data set sourced from Open ML. The description of which can be found here: https://www.openml.org/d/1489.
The response variable is Class and we use different unsupervised leaning methods to predict the class.
We get experience with unsupervised learning problems, some basic density estimation and clustering methods, and performance evaluation.
Specifically, we apply:
We also successfully apply visualisation techniques to present the clusters.